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In  many  least  squares  problems,  QR  decomposition  is  empioyed 

•  Factor  matrix  A  into  unitary  matrix  Q  and  upper  triangular  matrix  R  such 
that  A  =  QR 

Two  primary  aigorithms  avaiiable  to  compute  QR  decomposition 

•  Givens  rotations 

♦  Pre-multiplying  rows  i-1  and  i  of  a  matrix  by  a  2x2  Givens  rotation  matrix  will  zero  the 
entry  i,j) 

^  ^  ^ 
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•  Householder  reflections 
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♦  When  a  column  of  A  is  multiplied  by  an  appropriate  Householder  reflection,  it  is  possible 
to  zero  all  the  subdiagonal  entries  in  that  column 
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NRL  Problem  Statement 


■  Want  to  minimize  the  iatency  incurred  when  computing  the  QR 
decomposition  of  a  matrix  A  and  maintain  performance  across 
different  platforms 

■  Algorithm  consists  of  parallel  Givens  task  and  serial  Householder  task 

■  Parallel  Givens  task 

•  Allocate  blocks  of  rows  to  different  processors.  Each  processor  uses 
Givens  rotations  to  zero  all  available  entries  within  block  such  that 

♦  A(iJ )  =  Q  only  ifA( i-1,  j-1  )  =  ()  and  A( i,  j-l)  =  ^ 

■  Serial  Householder  task 

•  Once  Givens  task  terminates,  all  distributed  rows  are  sent  to  root 
processor  which  utilizes  Householder  reflections  to  zero  remaining  entries 
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Givens  Task 
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Each  processor  uses  Givens 
rotations  to  zero  entries  up  to  the 
topmost  row  in  the  assigned  group 

Once  task  is  complete,  rows  are 
returned  to  the  root  processor 

Givens  rotations  are  accumulated  in 
a  separate  matrix  before  updating 
all  of  the  columns  in  the  array 

•  Avoids  updating  columns  that  will 
not  be  use  by  an  immediately 
following  Givens  rotation 

•  Saves  significant  fraction  of 
computational  flops 
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Householder  Task 
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Root  processor  utilizes 
Househoider  reflections  to  zero 
remaining  entries  in  Givens 
coiumns 

By  computing  a-priori  where  zeroes 
wili  be  after  each  Givens  task  is 
compiete,  root  processor  can 
perform  a  sparse  matrix  muitiply 
when  performing  a  Householder 
update  for  additional  speed-up 

•  Householder  update  \sA=A-  Rw^A 

Householder  update  involves 
matrix-vector  multiplication  and  an 
outer  product  update 

•  Makes  extensive  use  of  BLAS 
routines 
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Dependency  Graph  -  Path  1 


■ 

Algorithm  must  zero  matrix  entries 
in  such  an  order  that  previousiy 
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Dependency  Graph  -  Path  2 


m  By  traversing  dependency  graph  in 
zig-zag  fashion,  cache  line  reuse  is 
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Parameterized  Algorithms : 
Memory  Hierarchy 


CPU 


Parameterized  Algorithms  make 
effective  use  of  memory  hierarchy 

•  improve  spatial  locality  of  memory 
references  by  grouping  together 
data  used  at  the  same  time 

•  improve  temporai  iocaiity  of  memory 
references  by  using  data  retrieved 
from  cache  as  many  times  as 
possible  before  cache  is  flushed 

Portable  performance  is  primary 
objective 


Second  Level  Cache 
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Givens  Parameter 
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— ► 

■  Parameter  c  controls  the  number  of 
columns  in  Givens  task 

■  Determines  how  many  matrix 
entries  can  be  zeroed  before  rows 
are  flushed  from  cache 
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Householder  Parameter 
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Parameter  h  controls  the  number  of 
columns  zeroed  by  Householder 
reflections  at  the  root  processor 

If  h  is  large,  the  root  processor 
performs  more  serial  work,  avoiding 
the  communication  costs 
associated  with  the  Givens  task 

However,  the  other  processors  sit 
idie  longer,  decreasing  the 
efficiency  of  the  algorithm 
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Work  Partition  Parameters 
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Parameters  v^and  wallow  operator 
to  assign  rows  to  processors  such 
that  the  work  load  is  balanced  and 
processor  idle  time  is  minimized 
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Results 
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NRL  Server  Computer  (1) 


HP  Superdome 

SPAWAR  in  San  Diego,  CA 


48  550-MHz  PA-RISC  8600  CPUs 
1.5  MB  on-chip  cache  per  CPU 
1  GB  RAM  /  Processor 
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Server  Computer  (2) 


512  R12000  processors  running  at 
400  MHz 

8  MB  on-chip  cache 

Up  to  2  GB  RAM  /  Processor 


SGi  03000 

NRL  in  Washington,  D.C. 


15 


NRL 


Embedded  Computer 


m  8  Motorola  7400  processors  with 
AltiVec  units 
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Effect  of  c 


•  Mercury 

•  SGI  03000 
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Effect  of  h 


o 

0) 

V) 


E 


a> 

E 


10 


10 


10 


- 1 - 

1 - 

•  Mercury 

•  SGI  03000 

• 

HP  Si 

jperdome 

0  5  10  15  20  25  30  35  40  45  50 

h 


100  X  100  array 
4  processors 
c  =  63,/?  =  12 
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Time  -  msec 
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Effect  of  w 


•  Mercury 

•  SGI  03000 

•  HP  Superdome 
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Performance  vs  Matrix  Size 


4  processors 
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Scalability 


500  X  500  array 
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Comparison  to  SCALAPACK 


9.4  ms 


For  matrix  sizes  on  the  order  of  100 
by  100,  the  Hybrid  QR  aigorithm 
outperforms  the  SCALAPACK 
library  routine  PSGEQRF  by  16% 

Data  distributed  in  block  cyclic 
fashion  before  executing  PSGEQRF 
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Conclusion 


■  Hybrid  QR  algorithm  using  combination  of  Givens  rotations  and 
Householder  reflections  is  efficient  way  to  compute  QR  decomposition 
for  smali  arrays  on  the  order  of  100  x  100 

■  Aigorithm  implemented  on  SGI  03000  and  HP  Superdome  servers  as 
well  as  Mercury  G4  embedded  computer 

■  Mercury  implementation  lacked  optimized  BLAS  routines  and  as  a 
consequence  performance  was  significantiy  siower 

■  Aigorithm  has  applications  to  signai  processing  problems  such  as 
adaptive  nuiiing  where  strict  iatency  targets  must  be  satisfied 


